Workshop Day 3A | 2022-07-27 Jeffrey M. Girard | Pitt Methods
Model I
Data Verification
It’s always a good idea to start with verification
Check that your variables are the correct type
Configure your factors’ levels and labels
Establish ordinal factors’ ordering
Explicitly set your missing values to NA
Check variables’ extrema and distributions
Check for erroneous and outlying values
Check the shape of continuous distributions
Check the overlap of categorical levels
Data Verification Live Coding
# SETUP:library(tidyverse)library(datawizard)salaries <-read_csv("salaries.csv")# ==============================================================================# LESSON: First, set all your categorical variables to factorssalariessalaries <- salaries |>mutate(across(c(rank, discipline, sex), factor)) |>print()# ==============================================================================# LESSON: Then check the summary statistics for problemssummary(salaries)describe_distribution(salaries)data_tabulate(salaries, exclude = is.numeric)# ==============================================================================# LESSON: Finally, check for empty level-combinationssalaries |>group_by(rank, discipline, sex) |>summarize(n =n()) |>arrange(n)
Distribution Geoms
Variable distributions are critical in data analysis
What are the most and least common values?
What are the extrema (min and max values)?
Are there any outliers or impossible values?
How much spread is there in the variable?
What shape does the distribution take?
Visualization is a quick way to assess this
They can also communicate it to others
Distribution Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)salaries <-read_csv("salaries.csv")# ==============================================================================# USECASE: Creating histogramsggplot(salaries, aes(x = salary)) +geom_histogram()ggplot(salaries, aes(x = salary)) +geom_histogram(bins =20)ggplot(salaries, aes(x = salary)) +geom_histogram(binwidth =2)ggplot(salaries, aes(x = salary)) +geom_histogram(binwidth =2, color ="red", size =1)ggplot(salaries, aes(x = salary)) +geom_histogram(binwidth =2, color ="red", size =1, fill ="white")# ==============================================================================# USECASE: Creating density plotsggplot(salaries, aes(x = salary)) +geom_density()ggplot(salaries, aes(x = salary)) +geom_density(color ="red", size =1, fill ="white")# ==============================================================================# USECASE: Creating box plotsggplot(salaries, aes(x = salary)) +geom_boxplot()ggplot(salaries, aes(x = salary, y = rank)) +geom_boxplot(varwidth =TRUE)# ==============================================================================# USECASE: Creating bar plots to count categorical variablesggplot(salaries, aes(x = rank)) +geom_bar()# ==============================================================================# PITFALL: Don't try to create histograms for categorical variablesggplot(salaries, aes(x = rank)) +geom_histogram() #error
Correlations
Correlations\((r)\) quantify the strength of the linear relationship between two variables from \(-1\) to \(+1\)
\(r\rightarrow-1\): as \(x\) increases, \(y\) decreases
\(r\rightarrow0\): as \(x\) increases, \(y\) doesn’t change
\(r\rightarrow+1\): as \(x\) increases, \(y\) also increases
Correlations may be the focus of statistical inference or just useful descriptive summaries
We can easily estimate many different types of correlation coefficients in R
Correlations Live Coding
# SETUP: Load the used packages (if needed) and read in the example datasetlibrary(tidyverse)library(correlation)salaries <-read_csv("salaries.csv")# ==============================================================================# LESSON: The standard correlation test in Rcor.test(salaries$salary, salaries$yrs.since.phd)# ==============================================================================# TIP: Fancier correlations with the {correlation} package (from {easystats})correlation(salaries)# ==============================================================================# LESSON: Creating a graphical model plot (requires installing some packages)correlation(salaries) |>plot()# ==============================================================================# LESSON: Creating and plotting the correlation matrixcorrelation(salaries) |>summary()correlation(salaries) |>summary() |>plot()correlation(salaries) |>summary(redundant =TRUE) |>plot()# ==============================================================================# p-value adjustmentscorrelation(salaries, p_adjust ="none") # very liberalcorrelation(salaries, p_adjust ="bonferroni") # very conservativecorrelation(salaries, p_adjust ="holm") # recommended# ==============================================================================# LESSON: Other correlation methodscorrelation(salaries, method ="kendall") # rank correlationcorrelation(salaries, method ="blomqvist") # similar to kendall but bettercorrelation(salaries, method ="distance") # linear and nonlinear relationships
Comparing Two Groups
A fundamental task in science is comparing the centrality of two groups’ distributions
It is common to compare means or medians
Independent groups are separate
Comparisons are between subjects
Did students in New York or students in California spend more time on social media?
Dependent groups are paired
Comparisons are within subjects
Did the students in my class spend more time on social media during the winter or the summer?
Independent Groups Live Coding
# SETUP: Load the used packages (if needed) and read in the example datasetlibrary(tidyverse)library(statsExpressions)library(ggstatsplot)salaries <-read_csv("salaries.csv")# ==============================================================================# USECASE: Let's compare the salaries across the two disciplinestwo_sample_test(data = salaries, x = discipline, y = salary, type ="parametric")# ==============================================================================# LESSON: We can also estimate each group's means by swapping the functioncentrality_description(data = salaries,x = discipline, y = salary, type ="parametric")# ==============================================================================# LESSON: Nonparametric does not assume a normal distribution of ytwo_sample_test(data = salaries, x = discipline, y = salary, type ="nonparametric")centrality_description(data = salaries, x = discipline, y = salary, type ="nonparametric")# ==============================================================================# TIP: Calculate the test results and generate a plot using {ggstatsplot}ggbetweenstats(data = salaries, x = discipline, y = salary, type ="parametric")ggbetweenstats(data = salaries, x = discipline, y = salary, type ="nonparametric")
Dependent Groups Live Coding
# SETUP: Load the used packages (if needed) and read in the example datasetlibrary(tidyverse)library(statsExpressions)library(ggstatsplot)interpersonal <-read_csv("interpersonal.csv")interpersonal# ==============================================================================# LESSON: Be sure to set paired to TRUE and provide a subject.idtwo_sample_test(data = interpersonal,x = time,y = problems,paired =TRUE,subject.id = patient,type ="parametric")centrality_description(data = interpersonal,x = time,y = problems,type ="parametric")# ==============================================================================# LESSON: There is a nonparametric version of this tootwo_sample_test(data = interpersonal,x = time,y = problems,paired =TRUE,subject.id = patient,type ="nonparametric")centrality_description(data = interpersonal,x = time,y = problems,type ="nonparametric")# ==============================================================================# TIP: No need to add paired because we are switching to ggwithinstats()ggwithinstats(data = interpersonal, x = time, y = problems,subject.id = patient,type ="parametric")ggwithinstats(data = interpersonal, x = time, y = problems,subject.id = patient,type ="nonparametric")
Comparing Many Groups
We may also compare more than two groups
Independent vs. dependent still applies
An omnibus test asks whether groups are different
If significant, one or more group is different
Posthoc tests ask how groups are different
Pairwise tests compare each pair of groups
They are often “gated” behind omnibus tests
With many groups, p-values are adjusted
Many Independent Live Coding
# SETUP: Load the used packages (if needed) and read in the example datasetlibrary(tidyverse)library(statsExpressions)library(ggstatsplot)salaries <-read_csv("salaries.csv")# ==============================================================================# USECASE: Let's compare the salaries across the three ranksoneway_anova(data = salaries,x = rank,y = salary,type ="parametric")centrality_description(data = salaries,x = rank,y = salary,type ="parametric")oneway_anova(data = salaries,x = rank,y = salary,type ="nonparametric")# ==============================================================================# USECASE: Let's compare the salaries between each pair of rankspairwise_comparisons(data = salaries,x = rank,y = salary,type ="parametric")pairwise_comparisons(data = salaries,x = rank,y = salary,type ="nonparametric")# ==============================================================================# TIP: Do it all (overall and pairwise tests) with ggbetweenstatsggbetweenstats(data = salaries, x = rank, y = salary,type ="parametric")ggbetweenstats(data = salaries, x = rank, y = salary,type ="nonparametric")
Many Dependent Live Coding
# SETUP: Load the used packages (if needed) and read in the example datasetlibrary(tidyverse)library(statsExpressions)library(ggstatsplot)elicitation <-read_csv("elicitation.csv")elicitation# ==============================================================================# USECASE: Let's compare the amusement self-reports across the four tasksoneway_anova(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="parametric")centrality_description(data = elicitation,x = Task,y = Amused,type ="parametric")oneway_anova(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="nonparametric")# ==============================================================================# USECASE: Let's compare the amusement self-reports between each pair of taskspairwise_comparisons(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="parametric")pairwise_comparisons(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="nonparametric")# ==============================================================================# TIP: Do it all (overall and pairwise tests) with ggwithinstatsggwithinstats(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="parametric")ggwithinstats(data = elicitation,x = Task,y = Amused,paired =TRUE,subject.id = Subject,type ="nonparametric")